{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zifyYw-ollsY",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Using  Evidently to Evaluate Data Drift for Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GLo2a7W2llse",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "This notebook shows how you can use Evidently to check the data for data drift.\n",
    "\n",
    "Acknowledgments:\n",
    "\n",
    "The dataset used in the example is from: https://www.kaggle.com/c/bike-sharing-demand/data?select=train.csv\n",
    "Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg\n",
    "More information about the dataset can be found in UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Eg-Ddrh7llsf"
   },
   "source": [
    "## Getting Started¶\n",
    "To run this tutorial:\n",
    "\n",
    "1. Install MLflow\n",
    "You can install MLflow with the following command `pip install mlflow` or install MLflow with scikit-learn via `pip install mlflow[extras]`\n",
    "More details:https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html#id5\n",
    "\n",
    "2. Install Evidently\n",
    "You can install Evidently with the following command `pip install evidently`\n",
    "More details: https://docs.evidentlyai.com/install-evidently \n",
    "\n",
    "3. Optionally, you can load data from https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset and save in locally or skip this step and download data with  ```requests```  using instructions below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0qLqOlt9llsg",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import requests\n",
    "import zipfile\n",
    "import io\n",
    "\n",
    "import plotly.offline as py #working offline\n",
    "import plotly.graph_objs as go\n",
    "\n",
    "from evidently.pipeline.column_mapping import ColumnMapping\n",
    "from evidently.report import Report\n",
    "from evidently.metric_preset import DataDriftPreset\n",
    "\n",
    "import mlflow\n",
    "import mlflow.sklearn\n",
    "from mlflow.tracking import MlflowClient"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LlKbn-2ullsj",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "warnings.simplefilter('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "odU8_XKlllsk",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "py.init_notebook_mode()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7lKnwPO-llsk",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "B3x2-w2Illsl",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "content = requests.get(\"https://archive.ics.uci.edu/static/public/275/bike+sharing+dataset.zip\").content\n",
    "with zipfile.ZipFile(io.BytesIO(content)) as arc:\n",
    "    raw_data = pd.read_csv(arc.open(\"day.csv\"), header=0, sep=',', parse_dates=['dteday'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "W0mG3lSQllsm",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#observe data structure\n",
    "raw_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jV3J8Egcllsm",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#set column mapping for Evidently Profile\n",
    "data_columns = ColumnMapping()\n",
    "data_columns.datetime = 'dteday'\n",
    "data_columns.numerical_features = ['weathersit', 'temp', 'atemp', 'hum', 'windspeed']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "QM1wTcUwllsn",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#set reference dates\n",
    "reference_dates = ('2011-01-01 00:00:00','2011-01-28 23:00:00')\n",
    "\n",
    "#set experiment batches dates\n",
    "experiment_batches = [\n",
    "    ('2011-02-01 00:00:00','2011-02-28 23:00:00'),\n",
    "    ('2011-03-01 00:00:00','2011-03-31 23:00:00'),\n",
    "    ('2011-04-01 00:00:00','2011-04-30 23:00:00'),\n",
    "    ('2011-05-01 00:00:00','2011-05-31 23:00:00'),  \n",
    "    ('2011-06-01 00:00:00','2011-06-30 23:00:00'), \n",
    "    ('2011-07-01 00:00:00','2011-07-31 23:00:00'), \n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qs1g2R93llsn",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Functions to clculate drift with Evidently"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "AG4b_lfxllsn"
   },
   "outputs": [],
   "source": [
    "data_drift_report = Report(metrics=[DataDriftPreset()])\n",
    "data_drift_report.run(reference_data=raw_data[:100], current_data=raw_data[100:200], column_mapping=data_columns)\n",
    "report = data_drift_report.as_dict()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "dL6HzIkSllso"
   },
   "outputs": [],
   "source": [
    "report[\"metrics\"][1][\"result\"][\"drift_by_columns\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-l4ohZ7Allsp",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#evaluate data drift with Evidently Profile\n",
    "def detect_dataset_drift(reference, production, column_mapping, get_ratio=False):\n",
    "    \"\"\"\n",
    "    Returns True if Data Drift is detected, else returns False.\n",
    "    If get_ratio is True, returns the share of drifted features.\n",
    "    The Data Drift detection depends on the confidence level and the threshold.\n",
    "    For each individual feature Data Drift is detected with the selected confidence (default value is 0.95).\n",
    "    Data Drift for the dataset is detected if share of the drifted features is above the selected threshold (default value is 0.5).\n",
    "    \"\"\"\n",
    "    \n",
    "    data_drift_report = Report(metrics=[DataDriftPreset()])\n",
    "    data_drift_report.run(reference_data=reference, current_data=production, column_mapping=column_mapping)\n",
    "    report = data_drift_report.as_dict()\n",
    "    \n",
    "    if get_ratio:\n",
    "        return report[\"metrics\"][0][\"result\"][\"drift_share\"]\n",
    "    else:\n",
    "        return report[\"metrics\"][0][\"result\"][\"dataset_drift\"]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "k59pXtSwllsp",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#evaluate data drift with Evidently Profile\n",
    "def detect_features_drift(reference, production, column_mapping, get_scores=False):\n",
    "    \"\"\"\n",
    "    Returns True if Data Drift is detected, else returns False. \n",
    "    If get_scores is True, returns scores value (like p-value) for each feature.\n",
    "    The Data Drift detection depends on the confidence level and the threshold.\n",
    "    For each individual feature Data Drift is detected with the selected confidence (default value is 0.95).\n",
    "    \"\"\"\n",
    "    \n",
    "    data_drift_report = Report(metrics=[DataDriftPreset()])\n",
    "    data_drift_report.run(reference_data=reference, current_data=production, column_mapping=column_mapping)\n",
    "    report = data_drift_report.as_dict()\n",
    "    \n",
    "    drifts = []\n",
    "    num_features = column_mapping.numerical_features if column_mapping.numerical_features else []\n",
    "    cat_features = column_mapping.categorical_features if column_mapping.categorical_features else []\n",
    "    for feature in num_features + cat_features:\n",
    "        drift_score = report[\"metrics\"][1][\"result\"][\"drift_by_columns\"][feature][\"drift_score\"]\n",
    "        if get_scores:\n",
    "            drifts.append((feature, drift_score))\n",
    "        else:\n",
    "            drifts.append((feature, report[\"metrics\"][1][\"result\"][\"drift_by_columns\"][feature][\"drift_detected\"]))\n",
    "             \n",
    "    return drifts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-arYuQRellsq",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Features Drift"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wN4ttBrYllsq",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "features_historical_drift = []\n",
    "\n",
    "for date in experiment_batches:\n",
    "    drifts = detect_features_drift(raw_data.loc[raw_data.dteday.between(reference_dates[0],reference_dates[1])], \n",
    "                           raw_data.loc[raw_data.dteday.between(date[0], date[1])], \n",
    "                           column_mapping=data_columns)\n",
    "    \n",
    "    features_historical_drift.append([x[1] for x in drifts])\n",
    "    \n",
    "features_historical_drift_frame = pd.DataFrame(features_historical_drift, \n",
    "                                               columns = data_columns.numerical_features)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Zt7zceZollsq",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "fig = go.Figure(data=go.Heatmap(\n",
    "                   z = features_historical_drift_frame.astype(int).transpose(),\n",
    "                   x = [x[1] for x in experiment_batches],\n",
    "                   y = data_columns.numerical_features,\n",
    "                   hoverongaps = False,\n",
    "                   xgap = 1,\n",
    "                   ygap = 1,\n",
    "                   zmin = 0,\n",
    "                   zmax = 1,\n",
    "                   showscale = False,\n",
    "                   colorscale = 'Bluered'\n",
    "))\n",
    "\n",
    "fig.update_xaxes(side=\"top\")\n",
    "\n",
    "fig.update_layout(\n",
    "    xaxis_title = \"Timestamp\",\n",
    "    yaxis_title = \"Feature Drift\"\n",
    ")\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-gsQQiB9llsr",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "features_historical_drift_pvalues = []\n",
    "\n",
    "for date in experiment_batches:\n",
    "    drifts = detect_features_drift(raw_data.loc[raw_data.dteday.between(reference_dates[0], reference_dates[1])], \n",
    "                           raw_data.loc[raw_data.dteday.between(date[0], date[1])],\n",
    "                           column_mapping=data_columns,\n",
    "                           get_scores=True)\n",
    "    \n",
    "    features_historical_drift_pvalues.append([x[1] for x in drifts])\n",
    "    \n",
    "features_historical_drift_pvalues_frame = pd.DataFrame(features_historical_drift_pvalues, \n",
    "                                                       columns = data_columns.numerical_features)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "65_nOkLallsr",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "fig = go.Figure(data=go.Heatmap(\n",
    "                   z = features_historical_drift_pvalues_frame.transpose(),\n",
    "                   x = [x[1] for x in experiment_batches],\n",
    "                   y = features_historical_drift_pvalues_frame.columns,\n",
    "                   hoverongaps = False,\n",
    "                   xgap = 1,\n",
    "                   ygap = 1,\n",
    "                   zmin = 0,\n",
    "                   zmax = 1,\n",
    "                   colorscale = 'reds_r'\n",
    "                   )\n",
    "               )\n",
    "\n",
    "fig.update_xaxes(side=\"top\")\n",
    "\n",
    "fig.update_layout(\n",
    "    xaxis_title = \"Timestamp\",\n",
    "    yaxis_title = \"p-value\"\n",
    ")\n",
    "\n",
    "fig.show(\"notebook\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bJat2lIAllss",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Dataset Drift"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5-ZV7Ug4llss",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "dataset_historical_drift = []\n",
    "\n",
    "for date in experiment_batches:\n",
    "    dataset_historical_drift.append(detect_dataset_drift(raw_data.loc[raw_data.dteday.between(reference_dates[0], reference_dates[1])], \n",
    "                           raw_data.loc[raw_data.dteday.between(date[0], date[1])], \n",
    "                           column_mapping=data_columns))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "iXFMLPXDllsv",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "fig = go.Figure(data=go.Heatmap(\n",
    "                   z = [[1 if x == True else 0 for x in dataset_historical_drift]],\n",
    "                   x = [x[1] for x in experiment_batches],\n",
    "                   y = [''],\n",
    "                   hoverongaps = False,\n",
    "                   xgap = 1,\n",
    "                   ygap = 1,\n",
    "                   zmin = 0,\n",
    "                   zmax = 1,\n",
    "                   colorscale = 'Bluered',\n",
    "                   showscale = False\n",
    "                   )\n",
    "               )\n",
    "\n",
    "fig.update_xaxes(side=\"top\")\n",
    "\n",
    "fig.update_layout(\n",
    "    xaxis_title = \"Timestamp\",\n",
    "    yaxis_title = \"Dataset Drift\"\n",
    ")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "WEgYspQ8llsw",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "dataset_historical_drift_ratio = []\n",
    "\n",
    "for date in experiment_batches:\n",
    "    dataset_historical_drift_ratio.append(detect_dataset_drift(raw_data.loc[raw_data.dteday.between(reference_dates[0], reference_dates[1])], \n",
    "                           raw_data.loc[raw_data.dteday.between(date[0],date[1])],\n",
    "                           column_mapping=data_columns,\n",
    "                           get_ratio=True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "y834xm1lllsw",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "fig = go.Figure(data=go.Heatmap(\n",
    "                   z = [dataset_historical_drift_ratio],\n",
    "                   x = [x[1] for x in experiment_batches],\n",
    "                   y = [''],\n",
    "                   hoverongaps = False,\n",
    "                   xgap = 1,\n",
    "                   ygap = 1,\n",
    "                   zmin = 0.5,\n",
    "                   zmax = 1,\n",
    "                   colorscale = 'reds'\n",
    "                  )\n",
    "               )\n",
    "\n",
    "fig.update_xaxes(side=\"top\")\n",
    "\n",
    "fig.update_layout(\n",
    "    xaxis_title = \"Timestamp\",\n",
    "    yaxis_title = \"Dataset Drift\"\n",
    ")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bqkWw_uCllsx",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Log Dataset Drift in MLFlow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_NWT4mHullsx",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#log into MLflow\n",
    "client = MlflowClient()\n",
    "\n",
    "#set experiment\n",
    "mlflow.set_experiment('Dataset Drift Analysis with Evidently')\n",
    "\n",
    "#start new run\n",
    "for date in experiment_batches:\n",
    "    with mlflow.start_run() as run: \n",
    "        \n",
    "        # Log parameters\n",
    "        mlflow.log_param(\"begin\", date[0])\n",
    "        mlflow.log_param(\"end\", date[1])\n",
    "\n",
    "        # Log metrics\n",
    "        metric = detect_dataset_drift(raw_data.loc[raw_data.dteday.between(reference_dates[0], reference_dates[1])], \n",
    "                           raw_data.loc[raw_data.dteday.between(date[0], date[1])],\n",
    "                           column_mapping=data_columns,\n",
    "                           get_ratio=True)\n",
    "        \n",
    "        mlflow.log_metric('dataset drift', metric)\n",
    "\n",
    "        print(run.info)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "rBbyPiVWllsx",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "#run MLflow UI (it will be more convinient to run it directly from the terminal)\n",
    "#!mlflow ui"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Support Evidently\n",
    "Enjoyed the tutorial? Star Evidently on GitHub to contribute back! This helps us continue creating free open-source tools for the community. https://github.com/evidentlyai/evidently"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
